Relative predictive value of sociodemographic factors for chronic diseases among All of Us participants: a descriptive analysis

Background Although sociodemographic characteristics are associated with health disparities, the relative predictive value of different social and demographic factors remains largely unknown. This study aimed to describe the sociodemographic characteristics of All of Us participants and evaluate the predictive value of each factor for chronic diseases associated with high morbidity and mortality. Methods We performed a cross-sectional analysis using de-identified survey data from the All of Us Research Program, which has collected social, demographic, and health information from adults living in the United States since May 2018. Sociodemographic data included self-reported age, sex, gender, sexual orientation, race/ethnicity, income, education, health insurance, primary care provider (PCP) status, and health literacy scores. We analyzed the self-reported prevalence of hypertension, coronary artery disease, any cancer, skin cancer, lung disease, diabetes, obesity, and chronic kidney disease. Finally, we assessed the relative importance of each sociodemographic factor for predicting each chronic disease using the adequacy index for each predictor from logistic regression. Results Among the 372,050 participants in this analysis, the median age was 53 years, 59.8% reported female sex, and the most common racial/ethnic categories were White (54.0%), Black (19.9%), and Hispanic/Latino (16.7%). Participants who identified as Asian, Middle Eastern/North African, and White were the most likely to report annual incomes greater than $200,000, advanced degrees, and employer or union insurance, while participants who identified as Black, Hispanic, and Native Hawaiian/Pacific Islander were the most likely to report annual incomes less than $10,000, less than a high school education, and Medicaid insurance. We found that age was most predictive of hypertension, coronary artery disease, any cancer, skin cancer, diabetes, obesity, and chronic kidney disease. Insurance type was most predictive of lung disease. Notably, no two health conditions had the same order of importance for sociodemographic factors. Conclusions Age was the best predictor for the assessed chronic diseases, but the relative predictive value of income, education, health insurance, PCP status, race/ethnicity, and sexual orientation was highly variable across health conditions. Identifying the sociodemographic groups with the largest disparities in a specific disease can guide future interventions to promote health equity. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-17834-1.

The consequences of social determinants are complex and dynamic, such that each may have a unique contribution to different diseases.For example, an analysis of patients from 169 community health centers across the United States found that physical environments (i.e.rural status and Census region) were strongly predictive of hypertension, but less predictive of cardiovascular disease [30].This study identified health disparities among different sociodemographic groups in a low-income, uninsured population.However, no studies to date have examined the relative contribution of different social factors across health conditions in a nationally representative population.
In this study, we analyze social determinants of health that fall into priority domains established by Healthy People 2023: economic stability (annual income), education access and quality (educational attainment), and healthcare access and quality (insurance type, health literacy, and primary care history) [4].Information related to the environment [4] and community support [4] domains were unavailable in the All of Us dataset at the time of analysis.We also assess sociodemographic factors commonly associated with health disparities, including age, sex, gender, sexual orientation, race, and ethnicity.
First, we describe how various sociodemographic factors (age, sex, gender, sexual orientation, annual income, educational attainment, insurance type, and health literacy) are distributed within self-identified racial and ethnic categories among participants in the All of Us Research Program.We then assess the relative contributions of each of these variables on the likelihood of selfreporting hypertension, coronary artery disease, cancer, lung disease, diabetes, obesity, and chronic kidney disease.We hypothesize that the strongest predictors will vary by disease and that a "one-size-fits-all" approach is insufficient for identifying and preventing health disparities.

Study population and data collection
All of Usaims to collect health information from over one million people in the United States, with the goal of advancing personalized health care [31].All of Us is therefore uniquely poised to examine social diversity within self-identified racial and ethnic categories and the interactions between different sociodemographic factors and health conditions.The full All of Us Research Program protocol has been published previously [32].Briefly, adults aged 18 years or older living in the United States were consented and enrolled in All of Us through a healthcare provider organization or directly through the website (www.allof us.nih.gov).Participants answered baseline demographic questions and had the option to provide additional information on their medical history.Surveys were completed in either English or Spanish, and resulting data were pooled as per the All of Us protocol [32].
For this study, we included sociodemographic and health information from collected from All of Us participants who were enrolled from May 2017 to January 2022 and completed the Basics Survey.Sociodemographic data and health history were selected based on the health disparities literature  and the data available in version 6 of All of Us.Age, sex, gender, sexual orientation, race, ethnicity, annual income, educational attainment, insurance type, health literacy, and history of seeing a primary care practitioner (PCP) within the last twelve months were included in the analysis.Self-reported health history included hypertension, coronary artery disease, any cancer, skin cancer, lung disease, diabetes, obesity, and chronic kidney disease.Health literacy was assessed via the Brief Health Literacy Screen, and participants were scored from 3-15 using the three questions, with higher scores indicating higher health literacy [33,34].Supplemental Table 1 provides details on each question and how each characteristic was coded.

Data analysis
All data were analyzed directly in the All of Us Researcher Workbench with R Statistical Software(version 4.2.2) [35].After data cleaning (Supplemental Tables 1.1 and 1.2), we compared the distributions of age, sex, gender, sexual orientation, annual income, educational attainment, insurance type, health literacy, and recent PCP visit by self-defined racial and ethnic category.We characterized sociodemographic disparities across racial and ethnic categories as these social constructs are tied to important determinants of health that cannot be directly measured in this study, such as racism.We also included participants who skipped or declined to answer a question.As described in Supplemental Table 1.1, participants were asked, "Which categories describe you?Select all that apply.Note, you may select more than one group." Participants who selected more than one group were coded as "Multiple." Sub-categories with 20 or fewer participants were excluded for participant privacy [36].
To assess the importance of each social determinant on each health condition, we compared the relative importance of each predictor (ranked from 1-11) based on each predictor's adequacy index (-2*log-likelihood of each predictor in the model divided by -2*log-likelihood of the full model) in participants who provided baseline self-reported information on a personal history of hypertension, coronary artery disease, any cancer, skin cancer, lung disease, diabetes, obesity, and chronic kidney disease with logistic regression (Table 1) [37,38].We used inverse probability weighting to weight each participant who reported clinical conditions (n = 141,878) such that the weighted sample (n = 378,811) was representative of the full 372,050 participants in All of Us version 6 based on the measured social and demographic variables, excluding PCP status due to a large number of version 6 participants who did not fill out the All of UsHealthcare Access and Utilization survey [39,40].In all models, age and health literacy were modeled as restricted cubic splines with five knots [37,38].As health literacy mediates some racial and ethnic disparities in health outcomes and behaviors [41][42][43][44][45][46][47], we analyzed the interaction between health literacy and race/ethnicity.As we were interested in the variance explained by each sociodemographic variable, not the specific impact of each category within each social variable, "Skip" and "Prefer Not To Answer" were included as subcategories for categorical variables.Participants with missing or unknown health outcomes in the weighted sample were excluded from the model for that chronic disease.Code is available at www. github.com/ ansle ykunn ath/ allof us.

Results
All of Us (version 6) included social and demographic data from 372,050 individuals, summarized in Table 1.The participants had a median age of 53 years, 59.8% were female, and 86.6% identified as straight.There were ten self-identified racial/ethnic categories, with the largest categories being White (54.0%),Black (19.9%), and Hispanic/Latino (16.7%).Incomes ranged from less than $10,000 (14.2%) to over $200,000 (6.2%), with 13% preferring not to answer.Most participants completed at least a high school degree or equivalent, with 43% of participants completing a college or advanced degree.Thirty percent of participants were insured by their employer, 17.9% by Medicaid and 15.8% by Medicare.Most participants (62.8%) reported feeling extremely confident when completing medical forms.Similarly, 60% and 66% of participants reported never needing assistance with reading health-related materials and never having difficulty understanding written health information, respectively.Nearly all participants (94.5%) who completed the Healthcare Access and Utilization instrument reported seeing a PCP within the last 12 months.The prevalence of health conditions ranged from 2.4% (chronic kidney disease) to 31.3% (hypertension) among participants who reported their health history.
Participants who identified as Asian (Middle Eastern/North African, and White were the most likely to report incomes greater than $200,000 per year (11.9%, 9.4%, 9.5%), advanced degrees (40.3%, 36.5%,29.7%), and employer/union insurance (50.6%, 38.6%, 38.0%), whereas participants who identified as Black, Hispanic, and Native Hawaiian/Pacific Islander were the most likely to report incomes less than $10,000 per year (32.9%, 17.2%, 16.9%), less than a high school education (15.1%, 25.8%, 6.4%), and Medicaid insurance (32.3%, 31.8%,28.1%).Participants who identified as Black, Hispanic, or Native Hawaiian/Pacific Islander were also more likely to mark "prefer not to answer".There was substantial income, educational, and insurance diversity within each racial/ethnic category as well (Fig. 1 and Supplemental Table 2).All racial/ethnic categories self-reported high health literacy (Supplemental Fig. 1).Those who skipped the race/ethnicity question were also more likely to skip all other demographic questions.Annual household income was skipped (7.8%) or preferred not to answer (13.0%) most frequently.Racial/ ethnic groups demonstrated large differences in selfreported disease prevalence (Supplemental Table 3).For example, among participants who reported clinical histories, Black participants were the most likely to report hypertension (45.1%), and Asian participants were the least likely to report obesity (9.0%).
Inverse probability weighting generated a pseudo-population very similar to the full All of Us (version 6) population (Table 2 and Supplemental Fig. 2).In general, age, income, education, insurance, and race/ethnicity were the most important predictors across all assessed selfreported health conditions (Fig. 2).However, each health condition has a different order of the relative importance of each sociodemographic factor.For most health conditions, age was the most important predictor.However, age was the 5th most important predictor for a selfreported history of lung disease, which was relatively more impacted by health insurance type.We also found that income was a better predictor of self-reporting obesity compared to the other health conditions.Across most diseases, health literacy was the least predictive variable.Overall, the relative predictive value of sociodemographic factors varied greatly among chronic health conditions.

Discussion
In this study, we analyzed the relative contribution of sociodemographic factors to chronic diseases in a large, diverse national sample.Age was the most predictive for self-reporting each health condition.Despite evidence that health literacy is a strong predictor of chronic diseases and health care utilization [48][49][50][51][52], we found that health literacy was the overall weakest predictor of chronic diseases among All of Us participants.This may be due, in part, to the lack of response variability within this cohort or response bias inherent to self-reporting.
Previous studies also suggest that the Brief Health Literacy Screen is more sensitive to identifying patients with inadequate health literacy than marginal health literacy [34].To our knowledge, this study is the largest population in which the Brief Health Literacy Screen has been used; further studies may be necessary to understand its predictive value for chronic diseases.Overall, the differences in the relative contribution of social and demographic factors to each chronic disease underscore the importance of carefully selecting covariates when assessing disease risk and prevention.Furthermore, identifying  2) but skipped a question on the variable, and "Missing" indicates participants who did not see the instrument for any reason the strongest predictors for diseases will be crucial for developing targeted interventions to prevent health disparities.
The All of Usparticipants encompass a wide range of demographic factors, social identities, and health conditions.Previous databases have been limited by a lack of diversity among study participants, leading to the exclusion of many marginalized groups in research.The UK Biobank contains data from over 500,000 individuals, 94% of whom identify as White [53,54].Similarly, the original Framingham Heart Study was 100% White and, despite the addition of new cohorts, 94% of current participants are White [55].Furthermore, despite the presence of individuals with diverse backgrounds within these databases, many researchers exclude non-White races in their studies due to low sampling.All of Us, on the other hand, seeks to improve health research diversity by actively including participants from groups historically excluded from research, thereby strengthening its use for health disparities research [31].Based on 2021 United States Census Bureau data, a representative sample of Americans should be about 59% non-Hispanic White [56].In our study, we found that 54% of All of Us participants identified as White, which is significantly more reflective of the national population than other databases.The rich diversity and scale of All of Us makes it a powerful tool for studying health conditions across various social variables, including race/ethnicity.
There are several limitations to this study.The data were self-reported, which introduces measurement error that likely differs by baseline sociodemographic characteristics, and cross-sectional, which prevents any causal interpretations from our predictive models.Participants in the All of Us Research Program were enrolled through their health provider organization or online, which may explain the disproportionately high rate of health literacy (89.9%) and PCP status (94.5%).Furthermore, to appropriately compare predictive value of sociodemographic factors across health conditions, we did not change each model for each health condition (e.g., changing the interaction terms to fit any particular condition) or account for factors such as family history or health-related behaviors (i.e., smoking, diet, exercise) specific to each health condition.Despite the large dataset, there were a small number of some health conditions, which limited which interactions we could model.For example, while we thought the interaction between income and self-reported race/ethnicity is likely pertinent in several of the available health conditions, the fact that both were categorical variables with many categories prevented model convergence.Several of the assessed sociodemographic factors likely interact with each other, and the cumulative effect of multiple sociodemographic factors may be greater than the sum of individual sociodemographic factors.Finally, the All of Usdatabase is, unfortunately, missing important covariates that likely impact health outcomes, including, but not limited to: experiences with discrimination and racism [57,58], psychosocial stress [59,60], environmental exposures [61], food security [62], exposures to gentrification [63,64], and interactions with the justice system [65,66].

Conclusion
In this study, we characterize the differences in sociodemographic factors, and chronic diseases among racial and ethnic groups, as well as the relative predictive value of sociodemographic factors for chronic diseases, using the All of Us database.Our findings demonstrate that the All of Us Research Program is well-poised to expand the diversity of population-level health outcome research in the United States.Finally, our predictive models, although missing factors that measure structural drivers of health, highlight that social and demographic factors are differentially predictive of individual health conditions and, therefore, the importance of thoughtful model generation that considers each health condition individually.Identifying the strongest predictors for each of these diseases can also guide strategies to eliminate health disparities.

Fig. 1
Fig. 1 Relationship between race/ethnicity category and age, gender, annual income, health insurance, education, and PCP status."Skip" indicates participants who filled out the corresponding All of Us instruments (Supplemental Tables 1.1 and 1.2) but skipped a question on the variable, and "Missing" indicates participants who did not see the instrument for any reason

Fig. 2
Fig. 2 Relative importance of selected social and demographic variables on health conditions based on adequacy index.The values reflect the proportion of health condition variance (column) explained by each sociodemographic factor (row)

Table 1
All of Us (version 6) Participant Information

Table 2
Predictive model characteristicsrcs Restricted Cubic Spline, NA not applicable; ":" denotes interaction term, PCP primary care provider